QA and Assembly
bmpvieira.com/assembly14

Bruno Vieira | @bmpvieira
Phd Student @ 
Bioinformatics and Population Genomics
Supervisor:
Yannick Wurm | @yannick__
Download data
bit.ly/ant-reads
Useful books
Papers
De novo genome assembly: what every biologist should know
Assemblathon 2: evaluating de novo methods of genome assembly[...]
Genome Assembly

Chen 2011
Types
Algoritms
- Overlap Layout Consensus
- De Bruijn
Strategies
Assembly paradigms
Overlap/Layout/Consensus

Overlap/Layout/Consensus

- A node corresponds to a read, an edge denotes an overlap between two reads.
- The overlap graph is used to compute a layout of reads and consensus sequence of contigs by pair-wise sequence alignment.
- Good for sequences with limited number of reads but significant overlap. Computational intensive for short reads (short and high error rate).
- Example assemblers: Celera Assembler, Arachne, CAP and PCAP
Chen 2011
de Brujin

de Brujin

- No need for all against all overlap discovery.
- Break reads into smaller sequences of DNA (K-mers, K denotes the length in bases of these sequences).
- Captures overlaps of length K-1 between these K-mers.
- More sensitive to repeats and sequencing errors.
- By construction, the graph contains a path corresponding to the original sequence.
- Example assemblers: Euler, Velvet, ABySS, AllPaths, SOAPdenovo, CLC Bio
Chen 2011

Schatz 2012

Schatz 2012
Too many assemblers
seqanswers.com/wiki/De-novo_assembly
A5, ABySS, ALLPATHS, CABOG, CLCbio, Contrail, Curtain, DecGPU, Forge, Geneious, GenoMiner, IDBA, Lasergene, MIRA, Newbler, PE-Assembler, QSRA, Ray, SeqMan NGen, SeqPrep, Sequencher, SHARCGS, SHORTY, SHRAP, SOAPdenovo, SR-ASM, SuccinctAssembly, SUTTA, Taipan, VCAKE, Velvet
Benchmarking
Why we need the assemblathon
Assembly quality assessment
-
Accuracy or “Correctness”
- Base accuracy – the frequency of calling the correct nucleotide at a given position in the assembly.
- Mis-assembly rate – the frequency of rearrangements, significant insertions, deletions and inversions.
Assembly quality assessment
-
Continuity
- Lengths distribution of contigs/scaffolds.
- Average length, minimum and maximum lengths, combined total lengths.
- N50 captures how much of the assembly is covered by relatively large contigs.

Assembly quality assessment
Assembly quality assessment
- **Fragment analysis** - Count how many randomly chosen fragments from species A genome can be found in assembly
- **Repeat analysis** - Choose fragments that either overlap or don’t overlap a known repeat
- **Gene finding** - How many genes are present in each assembly? ([CEGMA](http://korflab.ucdavis.edu/datasets/cegma/#SCT2))
Assembly quality assessment
- **Contamination** - “all libraries will contain some bacterial contamination”
- **Mauve analysis** - Uses whole genome alignment to reveal
- **BWA analysis** - Align contigs to genome
- **[Optical Maps](http://en.wikipedia.org/wiki/Optical_mapping) / [Irys](http://www.bionanogenomics.com/technology/why-genome-mapping/)**


FastQC Documentation


"(...)systematizes coverage in shotgun sequencing data sets, thereby decreasing sampling variation, discarding redundant data, and removing the majority of errors."
"(...)reduces the size of shotgun data sets and decreases the memory and time requirements for de novo sequence assembly, all without significantly impacting content of the generated contigs."
Magic? No, Bloom filters


What is digital normalization, anyway?
Why you shouldn't use digital normalization
Fasta

Fastq



Practical
bmpvieira.com/assembly14-practical
Copyright 2016 Authors. All rights reserverd.